Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages
نویسندگان
چکیده
We present our work on leveraging multilingual parallel corpora of small sizes for Statistical Machine Translation between Japanese and Hindi using multiple pivot languages. In our setting, the source and target part of the corpus remains the same, but we show that using several different pivot to extract phrase pairs from these source and target parts lead to large BLEU improvements. We focus on a variety of ways to exploit phrase tables generated using multiple pivots to support a direct source-target phrase table. Our main method uses the Multiple Decoding Paths (MDP) feature of Moses, which we empirically verify as the best compared to the other methods we used. We compare and contrast our various results to show that one can overcome the limitations of small corpora by using as many pivot languages as possible in a multilingual setting. Most importantly, we show that such pivoting aids in learning of additional phrase pairs which are not learned when the direct sourcetarget corpus is small. We obtained improvements of up to 3 BLEU points using multiple pivots for Japanese to Hindi translation compared to when only one pivot is used. To the best of our knowledge, this work is also the first of its kind to attempt the simultaneous utilization of 7 pivot languages at decoding time.
منابع مشابه
Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features...
متن کاملDialect Translation: Integrating Bayesian Co-segmentation Models with Pivot-based SMT
Recent research on multilingual statistical machine translation (SMT) focuses on the usage of pivot languages in order to overcome resource limitations for certain language pairs. This paper proposes a new method to translate a dialect language into a foreign language by integrating transliteration approaches based on Bayesian co-segmentation (BCS) models with pivot-based SMT approaches. The ad...
متن کاملMMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing
Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation b...
متن کاملMorphological Constraints for Phrase Pivot Statistical Machine Translation
The lack of parallel data for many language pairs is an important challenge to statistical machine translation (SMT). One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations especially when a poor morphology language is used as the pi...
متن کاملStatistical Machine Translation without a Source-side Parallel Corpus Using Word Lattice and Phrase Extension
Statistical machine translation (SMT) requires a parallel corpus between the source and target languages. Although a pivot-translation approach can be applied to a language pair that does not have a parallel corpus directly between them, it requires both source–pivot and pivot–target parallel corpora. We propose a novel approach to apply SMT to a resource-limited source language that has no par...
متن کامل